AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
# to build decision tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# to tune different models
from sklearn.model_selection import GridSearchCV
# to compute classification metrics
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score, f1_score, make_scorer
# to handle warnings
import warnings
warnings.filterwarnings('ignore')
# Define constants for file paths
# and other configurations
FOLDER = 'data/'
FILE = "Loan_Modelling.csv"
# Set the random seed for reproducibility
RS = 1
# Set the test size for train-test split
TEST_SIZE = 0.3
# Load the dataset
df = pd.read_csv(FOLDER + FILE)
# Create a copy of the data
data = df.copy()
# Observe the first few rows of the dataset
data.head()
# Observe the last few rows of the dataset
data.tail()
# shape of the dataset
data.shape
# Observe the data types of the columns
data.info()
# Observe the statistical summary of the dataset
data.describe().T
# Understand Age values
data["Age"].value_counts().sort_values(ascending=False).head(10)
# Understand Experience values
data["Experience"].unique()
# Understand ZipCode values
ZIPCount = data["ZIPCode"].value_counts()
print(f"{ZIPCount.count()} unique ZIP Codes")
print(f"{len(ZIPCount[ZIPCount >= 2])} ZIP Codes with more than 1 entry")
print(f"{len(ZIPCount[ZIPCount == 1])} ZIP Codes with only 1 entry")
# Understand Family categories
data["Family"].value_counts()
# Understand Education categories
data["Education"].value_counts()
# Understan Personal Loan categories
data["Personal_Loan"].value_counts()
data["Securities_Account"].value_counts()
data["CD_Account"].value_counts()
data["Online"].value_counts()
data["CreditCard"].value_counts()
# Check for duplicates
data.duplicated().sum()
The ID variable serves as a unique customer identifier with no intrinsic predictive value. For our machine learning pipeline, we will remove this feature in order to simplify.
data = data.drop(columns=['ID'])
data.head()
During exploratory data analysis, we identified negative values in the experience variable. Since negative experience years are logically impossible, these values likely represent data entry errors or measurement artifacts.
To address this issue, we applied the absolute value transformation to ensure all experience values are non-negative. This approach preserves the magnitude of the recorded values while correcting the invalid negative signs.
# use abs function to convert negative values to positive
data["Experience"] = data["Experience"].abs()
data["Experience"].unique()
Analysis of the ZIP Code variable revealed 467 unique values, comprising:
456 ZIP Codes with multiple occurrences
11 ZIP Codes appearing only once
Given the high cardinality of this feature, we will reduce dimensionality by utilizing only the first three digits of each ZIP Code. This approach maintains meaningful geographic information while significantly decreasing category count, as the initial digits represent:
| Digit(s) | Purpose | Example (90210) |
|---|---|---|
| 1st | Region (0-9) | 9 = West (CA, HI, AK) |
| 2nd-3rd | Sectional Center (mail hub) | 02 = Beverly Hills area |
| 4th-5th | Delivery zone | 10 = Specific part of Beverly Hills |
# Evaluate if using the first 3 digits in ZIPCode reduces the number of categories significatevely.
ZIPdata = data["ZIPCode"].astype(str)
ZIPdata = ZIPdata.str[0:3]
ZIPdata.nunique()
Using the first 3 digits of ZIP Code the number of categories was reduced from 467 to 57 categories. Still many categories to handle in the model.
# Evaluate if using the first 2 digits in ZIPCode reduces the number of categories significatevely.
ZIPdata = data["ZIPCode"].astype(str)
ZIPdata = ZIPdata.str[0:2]
ZIPdata.nunique()
Using the first 2 digits reduces from 467 to 7 categories. This number of categories is more managable and simple, while still providing geographycal meaning to the segmentation of customers.
# Implementing the first 2 digits in ZIPCode
# to reduce the number of categories
data["ZIPCode"] = data["ZIPCode"].astype(str)
data["ZIPCode"] = data["ZIPCode"].str[:2]
Several numeric variables in the dataset (e.g., Education, Family size) actually represent discrete categories rather than continuous values. Converting these to categorical type will Improve EDA effectiveness and Prevent analytical errors. It will enable proper visualization of frequency distributions in the EDA and avoid misleading statistical summaries
columns = ["ZIPCode", "Education", "Personal_Loan", "Securities_Account", "CD_Account", "Online", "CreditCard",]
data[columns] = data[columns].astype("category")
# Observe the data types of the columns after data preparation
data.info()
The dataframe has 8 category variables, 4 integer variables and 1 float variable.
# Statistical summary of the dataset after data preparation
data.describe().T
def plot_numerical_distributions(data, numerical_features, hue=None, figsize=(16, 12), dpi=100, bins=30, palette="viridis", plot_style="whitegrid", show_median=True):
# Set style and palette
sns.set_style(plot_style)
sns.set_palette(palette)
# Calculate grid dimensions
n_cols = 2
n_rows = (len(numerical_features) // n_cols) + (1 if len(numerical_features) % n_cols else 0)
# Create figure
fig = plt.figure(figsize=figsize, dpi=dpi)
fig.suptitle('Distribution Analysis', y=1.02, fontsize=16, fontweight='bold')
for i, feature in enumerate(numerical_features, 1):
ax = plt.subplot(n_rows, n_cols, i)
# Plot histogram with KDE
sns.histplot(data=data, x=feature, hue=hue, kde=True, bins=bins, edgecolor='white', linewidth=0.5, stat='count', alpha=0.7)
# Add median line
if show_median:
median = data[feature].median()
ax.axvline(median, color='red', linestyle='--', linewidth=1.5)
ax.text(median*1.05, ax.get_ylim()[1]*0.9,
f'Median: {median:.1f}', color='red')
# Formatting
ax.set_title(f'{feature} Distribution', pad=10)
ax.set_xlabel('')
ax.grid(alpha=0.3)
plt.tight_layout()
return fig
def plot_categorical_distributions(data, categorical_features, hue="",figsize=(16, 18), palette="viridis", plot_style="whitegrid", annotate=True, rotate_labels=False):
# Set visual style
sns.set_style(plot_style)
plt.rcParams['font.size'] = 12
# Calculate grid dimensions
n_cols = 2
n_rows = (len(categorical_features) // n_cols) + (1 if len(categorical_features) % n_cols else 0)
# Initialize figure
fig = plt.figure(figsize=figsize, dpi=100, layout="constrained")
fig.suptitle('Distribution Analysis', fontsize=18, fontweight='bold', y=1.02)
# Create subplots
for idx, feature in enumerate(categorical_features, start=1):
ax = fig.add_subplot(n_rows, n_cols, idx)
# Plot with improved aesthetics
if hue:
# If hue is provided, use countplot with hue
plot = sns.countplot(data=data, x=feature, hue=hue, palette=palette, edgecolor='black', linewidth=0.5, alpha=0.85)
else:
# If no hue is provided, use countplot without hue
plot = sns.countplot(data=data, x=feature, palette=palette, edgecolor='black', linewidth=0.5, alpha=0.85, order=data[feature].value_counts().index) # Sort by frequency
# Add count annotations
if annotate:
for p in ax.patches:
ax.annotate(
f'{p.get_height():,.0f}',
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center',
xytext=(0, 5),
textcoords='offset points',
fontsize=10
)
# Rotate labels if needed
if rotate_labels:
plt.setp(ax.get_xticklabels(), rotation=45, ha='right')
# Format subplot
ax.set_title(f'{feature} Distribution', pad=12, fontweight='semibold')
ax.set_xlabel('')
ax.set_ylabel('Count', fontsize=12)
ax.grid(visible=True, alpha=0.2, axis='y')
# Remove top/right spines
sns.despine(ax=ax, top=True, right=True)
return fig
def plot_numerical_boxplots(data, numerical_features, figsize=(16, 12), palette="viridis", plot_style="whitegrid", showfliers=True, notch=False, orient="v", hue=None):
# Set visual style
sns.set_style(plot_style)
plt.rcParams['font.size'] = 12
# Calculate grid dimensions
n_cols = 2
n_rows = (len(numerical_features) // n_cols) + (1 if len(numerical_features) % n_cols else 0)
# Initialize figure
fig = plt.figure(figsize=figsize, dpi=100, layout="constrained")
fig.suptitle('Numerical Features Distribution Analysis (Boxplots)', fontsize=18, fontweight='bold', y=1.02)
# Create subplots
for idx, feature in enumerate(numerical_features, start=1):
ax = fig.add_subplot(n_rows, n_cols, idx)
# Plot boxplot with enhanced aesthetics
sns.boxplot(
data=data,
x=feature if orient == "v" else None,
y=None if orient == "v" else feature,
hue=hue,
palette=palette,
width=0.6,
showfliers=showfliers,
notch=notch,
linewidth=1.5,
fliersize=3,
saturation=0.8
)
# Add mean marker
mean_val = data[feature].mean()
if orient == "v":
ax.axvline(mean_val, color='red', linestyle='--', linewidth=1.2, label=f'Mean: {mean_val:.1f}')
else:
ax.axhline(mean_val, color='red', linestyle='--', linewidth=1.2, label=f'Mean: {mean_val:.1f}')
ax.legend(loc='upper right')
# Format subplot
ax.set_title(f'{feature} Distribution', pad=12, fontweight='semibold')
if orient == "v":
ax.set_ylabel('Value', fontsize=12)
else:
ax.set_xlabel('Value', fontsize=12)
ax.grid(visible=True, alpha=0.2, axis='y' if orient == "v" else 'x')
# Remove top/right spines
sns.despine(ax=ax, top=True, right=True)
# Add figure caption
caption = "Note: Red dashed line shows mean value. Box shows IQR (25th-75th percentile)"
if showfliers:
caption += ". Points show outliers."
if hue:
caption += f" Grouped by {hue}."
fig.text(
0.5, 0.01,
caption,
ha='center',
fontsize=12,
color='dimgray'
)
return fig
# Plot Histograms for numerical variables
numerical = ["Age", "Experience", "Income", "CCAvg", "Mortgage"]
fig = plot_numerical_distributions(data, numerical)
# Plot boxplots for numerical variables
with warnings.catch_warnings():
warnings.simplefilter("ignore")
fig = plot_numerical_boxplots(data=data, numerical_features=numerical)
plt.show()
# Countplot for Family variable
sns.countplot(data=data, x="Family", palette="viridis", edgecolor='black', linewidth=0.5, alpha=0.85, stat='count')
plt.title("Family Distribution", pad=12, fontweight='semibold')
plt.xlabel('Family Size', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.grid(visible=True, alpha=0.2, axis='y')
plt.show()
# Boxplot for Family variable
sns.boxplot(data=data, x="Family", palette="viridis")
plt.title("Family Distribution", pad=12, fontweight='semibold')
plt.xlabel('Family Size', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.grid(visible=True, alpha=0.2, axis='y')
plt.show()
# Barplots for categorical variables
categorical = ["ZIPCode", "Education", "Personal_Loan", "Securities_Account", "CD_Account", "Online", "CreditCard",]
with warnings.catch_warnings():
warnings.simplefilter("ignore")
fig = plot_categorical_distributions(data, categorical)
plt.show()
# Understand the distribution of the numerical data.
sns.pairplot(data, hue="Personal_Loan", diag_kind="kde", palette="viridis", height=2.5)
plt.show()
# Observe the suspected strongest predictors distribution against Education.
sns.relplot(x="Income", y="CCAvg", hue="Personal_Loan", col="Education", data=data, palette="viridis");
sns.relplot(x="Income", y="CCAvg", hue="Personal_Loan", col="Family", data=data, palette="viridis");
# Generate a correlation matrix
data.corr(numeric_only=True)
# Plot the correlation matrix
sns.heatmap(data.corr(numeric_only=True), annot=True, cmap="seismic", center=0, fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()
Analyzing how categorical and numerical features interact with Personal Loan reveals critical insights into loan adoption patterns.
# Plot the frequency of the Personal Loan vs categorical variables
fig = plot_categorical_distributions(data, categorical, hue="Personal_Loan", palette="rocket")
with warnings.catch_warnings():
warnings.simplefilter("ignore")
plt.show();
Education Level & Personal Loan
Securities Account & Personal Loan
CD Account & Personal Loan
Online Banking & Personal Loan
Credit Card Ownership & Personal Loan
# Plot the distribution of Personal Loan vs numerical variables
custom_palette = ['#33FF57','#5733FF','#FF5733']
fig = plot_numerical_distributions(data, numerical, hue="Personal_Loan", palette="tab10")
# Plot boxplots for numerical variables
fig = plot_numerical_boxplots(data=data, numerical_features=numerical + ["Family"], hue="Personal_Loan", palette="rocket");
plt.suptitle('Numerical Features Distribution Analysis (Boxplots) by Personal Loan', fontsize=18, fontweight='bold', y=1.02)
plt.show()
# Family vs Personal Loan
fig = plt.figure(figsize=(8, 6), dpi=100)
sns.countplot(data=data, x="Family", hue="Personal_Loan", palette="viridis", edgecolor='black', linewidth=0.5, alpha=0.85, stat='count')
plt.title("Family Distribution by Personal Loan", pad=12, fontweight='semibold')
plt.xlabel('Family Size', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.grid(visible=True, alpha=0.2, axis='y')
plt.legend(title="Personal Loan", loc='upper right')
plt.show()
Age & Personal Loan
Experience & Personal Loan
Income & Personal Loan
Credit Card Average (CCAvg) & Personal Loan
Mortgage & Personal Loan
Family Size & Personal Loan
1. What is the distribution of the mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
2. How many customers have credit cards?
3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
4. How does a customer's interest in purchasing a loan vary with their age?
5. How does a customer's interest in purchasing a loan vary with their education?
# checking for null values
data.isnull().sum()
The data has no missing values.
results = []
outlier_columns = ["Income", "CCAvg", "Mortgage"]
for col in outlier_columns:
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = data[col].quantile(0.25)
Q3 = data[col].quantile(0.75)
# Calculate IQR
IQR = Q3 - Q1
# Determine lower and upper bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Identify outliers
outliers = data[(data[col] < lower_bound) | (data[col] > upper_bound)]
# Store results
results.append({
'Column': col,
'Lower Bound': lower_bound,
'Upper Bound': upper_bound,
'Outlier Count': len(outliers),
'Outlier Percentage (%)': (len(outliers) / len(data)) * 100,
})
outlier_df = pd.DataFrame(results)
outlier_df
During the Correlation analysis this two variables are correlated 99%. In this case one of this variables can be dropped in order to simplify the model without loosing meaningful information. Age is better suited as it provides better information of the customer.
# Drop Experience column
data = data.drop(columns=["Experience"], axis=1)
data
# List categorical columns
print(categorical)
From the categorical values it is needed to decide which ones require dummy variables:
# Create dummy variables for ZIPCode and Education
data_dummy = pd.get_dummies(data, columns=["ZIPCode", "Education"], drop_first=True)
data_dummy.head()
In the Constant Definitions, The test size was defined as .20 and the Random Seed as 42.
# Define Features and Target
X = data_dummy.drop(columns=["Personal_Loan"], axis=1)
y = data_dummy["Personal_Loan"]
# Observe features
X.head()
# Observe target
y.head()
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=RS, stratify=y)
# Observe the shape of the training and testing sets
print(f"X_train: {X_train.shape}")
print(f"X_test: {X_test.shape}")
print(f"y_train: {y_train.shape}")
print(f"y_test: {y_test.shape}")
The business wants to maximaize the conversion of customers to borrowers. Therefore it is needed to reduce the number of False Negatives (Customers predicted that wouldn't accept a loan, but actually did). Recall will be used as the model evaluation criteria.
Function that return a dataframe with performance metrics (Accuracy, Recall, Precision and F1). Imported from the Hands On Decision Tree Notebook.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
Function that returns the Confussion Matrix
def plot_confussion_matrix(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
# Predict the target values using the provided model and predictors
y_pred = model.predict(predictors)
# Compute the confusion matrix comparing the true target values with the predicted values
cm = confusion_matrix(target, y_pred)
# Create labels for each cell in the confusion matrix with both count and percentage
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2) # reshaping to a matrix
# Set the figure size for the plot
plt.figure(figsize=(6, 4))
# Plot the confusion matrix as a heatmap with the labels
sns.heatmap(cm, annot=labels, fmt="")
# Add a label to the y-axis
plt.ylabel("True label")
# Add a label to the x-axis
plt.xlabel("Predicted label")
# Decision Tree Classifier
first_tree = DecisionTreeClassifier(random_state=RS)
# Fit the model
first_tree.fit(X_train, y_train)
# Evaluate the model on the training set
first_tree_train_performance = model_performance_classification(first_tree, X_train, y_train)
print("Train Performance")
print(first_tree_train_performance)
# Evaluate the model on the testing set
first_tree_test_performance = model_performance_classification(first_tree, X_test, y_test)
print("Test Performance")
print(first_tree_test_performance)
# Plot the confusion matrix for the training set
plot_confussion_matrix(first_tree, X_train, y_train)
plt.title("Confusion Matrix - Train Set")
plt.show()
# Plot the confusion matrix for the testing set
plot_confussion_matrix(first_tree, X_test, y_test)
plt.title("Confusion Matrix - Test Set")
plt.show()
As explained earlier the frequency of customers that didn't accept the last Personal Loan campaign is 90%. The first model might be biased toward the dominant class. So a second model using a balanced class weight will be used and compared.
# Initialized a second Decision Tree Classifier with class weights
second_tree = DecisionTreeClassifier(random_state=RS, class_weight="balanced")
# Fit the model
second_tree.fit(X_train, y_train)
# Compare the performance of the second tree with the first tree
print("First Decision Tree Performance - Train Performance")
print(first_tree_train_performance)
print("First Decision Tree Performance - Test Performance")
print(first_tree_test_performance)
print("-"*50)
second_tree_train_performance = model_performance_classification(second_tree, X_train, y_train)
print("Balanced Class Decision Tree - Train Performance")
print(second_tree_train_performance)
second_tree_test_performance = model_performance_classification(second_tree, X_test, y_test)
print("Balanced Class Decision Tree - Test Performance")
print(second_tree_test_performance)
# Plot the confusion matrix for the train set
plot_confussion_matrix(second_tree, X_train, y_train)
plt.title("Balanced - Confusion Matrix - Balanced Class - Train Set")
plt.show()
# Plot the confusion matrix for the testing set
plot_confussion_matrix(second_tree, X_test, y_test)
plt.title("Balanced Confusion Matrix - Balanced Class - Test Set")
plt.show()
# Compare the number of leaves and depth of the trees
print("-"*50)
print("Decision Tree without class weights")
print("Number of leaves: ", first_tree.get_n_leaves())
print("Depth of the tree: ", first_tree.get_depth())
print("-"*50)
print("Decision Tree with class weights")
print("Number of leaves: ", second_tree.get_n_leaves())
print("Depth of the tree: ", second_tree.get_depth())
print("-"*50)
Both tree perform perfectly on the train data, as in both trees the evaluation metrics are equal to 1. Using the test data the Balanced Class Tree performs better and all evaluation metrics are improved.
The Tree with balanced weight class performs sligthly better with Recall.
Further visualization will use the first tree as there is no significant improvement.
# Plot the decision tree
columns = X_train.columns
plt.figure(figsize=(20, 20))
first_tree_plot = tree.plot_tree(first_tree, feature_names=columns, filled=True, rounded=True, fontsize=10)
plt.title("Decision Tree - Train Set")
plt.show()
# printing the text report of the decision tree
print(tree.export_text(first_tree, feature_names=columns, show_weights=True))
# Obtain the feature importances from the first tree
feature_importance = pd.DataFrame(first_tree.feature_importances_, index=columns, columns=["Importance"]).sort_values("Importance", ascending=False)
feature_importance["Importance %"] = (feature_importance["Importance"] * 100).round(2)
feature_importance
# Plot the feature importances
plt.title("Feature Importance - Decision Tree")
plt.xlabel("Importance %")
plt.ylabel("Features")
with warnings.catch_warnings():
warnings.simplefilter("ignore")
sns.barplot(x=feature_importance["Importance %"], y=feature_importance.index, palette="viridis")
plt.show()
Observations:
# Initialize a Decision Tree Classifier
estimator = DecisionTreeClassifier(random_state=RS)
# Define the parameter grid for GridSearchCV
parameters = {
"class_weight": ["balanced", None],
"max_depth": np.arange(2, 7, 2),
"max_leaf_nodes": [50, 75, 150, 250],
"min_samples_split": [10, 30, 50, 70],
}
# Define the scoring metric
scorer = make_scorer(recall_score)
# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=estimator, param_grid=parameters, scoring=scorer, cv=5)
grid_search.fit(X_train, y_train)
estimator = grid_search.best_estimator_
print("Best parameters from GridSearchCV:")
print(grid_search.best_params_)
print("Best Recall from GridSearchCV:")
print(grid_search.best_score_)
# Fit the model with the best parameters
estimator.fit(X_train, y_train)
The Pre-prunned tree is less complex than the Balanced and Unbalance Models. Also has a much better Recall Score with 0.98, compared to 0.86 and 0.87 of the Balanced and Unbalance Models.
A depth of 2 might cause underfitting. Evaluating the model will be important to notice how this model perfoms with test data.
# fit the model with the best parameters
pre_tree = estimator
pre_tree.fit(X_train, y_train)
pre_tree_train_performance = model_performance_classification(pre_tree, X_train, y_train)
print("Train Performance")
print(pre_tree_train_performance)
pre_tree_test_performance = model_performance_classification(pre_tree, X_test, y_test)
print("Test Performance")
print(pre_tree_test_performance)
# Plot the Pre-Prunning confusion matrix for the training set
plot_confussion_matrix(pre_tree, X_train, y_train)
plt.title("Pre-Prunning - Recall - Confusion Matrix - Train Set")
plt.show()
# Plot the Pre-Prunning confusion matrix for the testing set
plot_confussion_matrix(pre_tree, X_test, y_test)
plt.title("Pre-Prunning - Recall - Confusion Matrix - Test Set")
plt.show()
Pre-Pruned Decision Tree Confusion Matrix - Test Set
True Positives (TP): 144 (9.60%)
Perfect Recall: With no false negatives, the pre-pruned model captures all actual positive cases, achieving a recall of 100%. This means the model is very aggressive in flagging potential positives.
# Compare the performance of the first, second and pre-prunned trees
print("First Decision Tree Performance - Train Performance")
print(first_tree_train_performance)
print("-"*50)
print("First Decision Tree Performance - Test Performance")
print(first_tree_test_performance)
print("-"*50)
second_tree_train_performance = model_performance_classification(second_tree, X_train, y_train)
print("Balanced Class Decision Tree - Train Performance")
print(second_tree_train_performance)
print("-"*50)
second_tree_test_performance = model_performance_classification(second_tree, X_test, y_test)
print("Balanced Class Decision Tree - Test Performance")
print(second_tree_test_performance)
print("-"*50)
print("Pre-Prunning Decision Tree - Train Performance")
print(pre_tree_train_performance)
print("-"*50)
print("Pre-Prunning Decision Tree - Test Performance")
print(pre_tree_test_performance)
print("-"*50)
| Model Type | Dataset | Accuracy | Recall | Precision | F1 |
|---|---|---|---|---|---|
| First Decision Tree | Train | 1.0 | 1.0 | 1.0 | 1.0 |
| Test | 0.981333 | 0.861111 | 0.939394 | 0.898551 | |
| Balanced Class Decision Tree | Train | 1.0 | 1.0 | 1.0 | 1.0 |
| Test | 0.980667 | 0.875 | 0.919708 | 0.896797 | |
| Pre-Pruning Decision Tree | Train | 0.8 | 1.0 | 0.324324 | 0.489796 |
| Test | 0.816 | 1.0 | 0.342857 | 0.510638 |
# Plot the Pre-prunned Decision Tree
plt.figure(figsize=(10, 10))
pre_tree_plot = tree.plot_tree(pre_tree, feature_names=columns, filled=True, rounded=True, fontsize=10)
plt.title("Pre-Prunned - Recall - Decision Tree - Best Estimator - Train Set")
plt.show()
# The text report of the decision tree
print(tree.export_text(pre_tree, feature_names=columns, show_weights=True))
# Obtain the feature importances from the Pre-prunned tree
feature_importance = pd.DataFrame(pre_tree.feature_importances_, index=columns, columns=["Importance"]).sort_values("Importance", ascending=False)
feature_importance["Importance %"] = (feature_importance["Importance"] * 100).round(2)
feature_importance
# Plot the feature importances
plt.title("Feature Importance - Pre-Prunned Decision Tree - Best Estimator")
plt.xlabel("Importance %")
plt.ylabel("Features")
with warnings.catch_warnings():
warnings.simplefilter("ignore")
sns.barplot(x=feature_importance["Importance %"], y=feature_importance.index, palette="viridis")
plt.show()
# Initiate a post-prunning Decision Tree Classifier
post_tree = DecisionTreeClassifier(random_state=RS)
path = post_tree.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
# Observe alphas and impurities
pd.DataFrame(path)
# Plot the effective alphas
plt.figure(figsize=(10, 6))
plt.plot(ccp_alphas, impurities, marker="o", drawstyle="steps-post")
plt.title("Effective Alphas vs. Impurity")
plt.xlabel("Effective alpha")
plt.ylabel("Impurity")
plt.grid()
plt.show()
clfs = []
for ccp_alpha in ccp_alphas:
# Create a new Decision Tree Classifier with the current alpha
clf = DecisionTreeClassifier(random_state=RS, ccp_alpha=ccp_alpha, class_weight="balanced")
# Fit the model to the training data
clf.fit(X_train, y_train)
# Append the classifier to the list
clfs.append(clf)
# Print the number of nodes in each tree
for i, clf in enumerate(clfs):
print(f"CCP Alpha: {ccp_alphas[i]:.4f}, Number of nodes: {clf.tree_.node_count}")
print("-"*50)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depths = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_title("Number of nodes vs. ccp_alpha")
ax[0].set_xlabel("ccp_alpha")
ax[0].set_ylabel("Number of nodes")
ax[1].plot(ccp_alphas, depths, marker="o", drawstyle="steps-post")
ax[1].set_title("Depth vs. ccp_alpha")
ax[1].set_xlabel("ccp_alpha")
ax[1].set_ylabel("Depth")
plt.show()
# Obtain the recall scores for each tree with train data
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
# Obtain the recall scores for each tree with test data
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
train_recall_scores = [clf.score(X_train, y_train) for clf in clfs]
test_recall_scores = [clf.score(X_test, y_test) for clf in clfs]
# Plot alpha vs recall
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
# define parameters for GridSearchCV
params = {
"ccp_alpha": ccp_alphas,
"class_weight": ["balanced", None],
}
grid = GridSearchCV(DecisionTreeClassifier(random_state=RS), params, scoring="recall", cv=5)
grid.fit(X_train, y_train)
best_clf = grid.best_estimator_
print("Best parameters from GridSearchCV:")
print(grid.best_params_)
print("Best Recall from GridSearchCV:")
print(grid.best_score_)
print(f"Max depth: {best_clf.get_depth()}")
print(f"Number of Leaves: {best_clf.get_n_leaves()}")
# Plot the confusion matrix for the training set
plot_confussion_matrix(best_clf, X_train, y_train)
plt.title("Post-Prunned - Recall - Confusion Matrix - Train Set")
plt.show()
# Plot the confusion matrix for the testing set
plot_confussion_matrix(best_clf, X_test, y_test)
plt.title("Post-Prunned - Recall - Confusion Matrix - Test Set")
plt.show()
Post-Pruned Decision Tree Confusion Matrix - Test Set
True Positives (TP):** 142 (9.47%)
After pruning, the model significantly reduces misclassifications. The number of false positives drops sharply from 276 in the Pre-Prunned model to 95. Although there are a couple of false negatives (FN = 2), the overall precision improves.
# Get the performance of the best classifier on the training set
post_tree_train_performance = model_performance_classification(best_clf, X_train, y_train)
# Get the performance of the best classifier on the testing set
post_tree_test_performance = model_performance_classification(best_clf, X_test, y_test)
# Compare the performance of the first, second and pre-prunned trees
print("First Decision Tree Performance - Train Performance")
print(first_tree_train_performance)
print("-"*50)
print("First Decision Tree Performance - Test Performance")
print(first_tree_test_performance)
print("-"*50)
second_tree_train_performance = model_performance_classification(second_tree, X_train, y_train)
print("-"*50)
print("Pre-Prunning Decision Tree - Train Performance")
print(pre_tree_train_performance)
print("-"*50)
print("Pre-Prunning Decision Tree - Test Performance")
print(pre_tree_test_performance)
print("-"*50)
print("Post-Prunning Decision Tree - Train Performance")
print(post_tree_train_performance)
print("-"*50)
print("Post-Prunning Decision Tree - Test Performance")
print(post_tree_test_performance)
print("-"*50)
# plot the Post-Prunned Decision Tree
plt.figure(figsize=(20, 20))
post_tree_plot = tree.plot_tree(best_clf, feature_names=columns, filled=True, rounded=True, fontsize=10)
plt.title("Post-Prunned - Decision Tree - Best Estimator - Post Pruning")
plt.show()
# printing the text report of the decision tree
print(tree.export_text(best_clf, feature_names=columns, show_weights=True))
# Obtain the feature importances from the Post-prunned tree
feature_importance = pd.DataFrame(best_clf.feature_importances_, index=columns, columns=["Importance"]).sort_values("Importance", ascending=False)
feature_importance["Importance %"] = (feature_importance["Importance"] * 100).round(2)
feature_importance
# Plot the feature importances
plt.title("Feature Importance - Post-Prunned Decision Tree - Best Estimator")
plt.xlabel("Importance %")
plt.ylabel("Features")
with warnings.catch_warnings():
warnings.simplefilter("ignore")
sns.barplot(x=feature_importance["Importance %"], y=feature_importance.index, palette="viridis")
plt.show()
# Comparing models performance with Recall as evaluation metric
models_train_comparison_recall = pd.concat(
[
first_tree_train_performance.T,
second_tree_train_performance.T,
pre_tree_train_performance.T,
post_tree_train_performance.T,
],
axis=1,
)
models_train_comparison_recall.columns = [
"First Tree without class weight",
"Second Tree with class weight",
"Pre Pruning",
"Post Pruning",
]
print("Train Performance Comparison")
models_train_comparison_recall
# Comparing models performance with Recall as evaluation metric
models_test_comparison_recall = pd.concat(
[
first_tree_test_performance.T,
second_tree_test_performance.T,
pre_tree_test_performance.T,
post_tree_test_performance.T,
],
axis=1,
)
models_test_comparison_recall.columns = [
"First Tree without class weight",
"Second Tree with class weight",
"Pre Pruning",
"Post Pruning",
]
print("Test Performance Comparison")
models_test_comparison_recall
# Compare the complexity of the models with Recall as evaluation metric
print("-"*50)
print("Model Complexity Comparison")
print("Decision Tree without class weights")
print("Number of leaves: ", first_tree.get_n_leaves())
print("Depth of the tree: ", first_tree.get_depth())
print("-"*50)
print("Decision Tree with class weights")
print("Number of leaves: ", second_tree.get_n_leaves())
print("Depth of the tree: ", second_tree.get_depth())
print("-"*50)
print("Pre Prunning Decision Tree - Recall")
print("Number of leaves: ", pre_tree.get_n_leaves())
print("Depth of the tree: ", pre_tree.get_depth())
print("-"*50)
print("Post Prunning Decision Tree - Recall")
print("Number of leaves: ", best_clf.get_n_leaves())
print("Depth of the tree: ", best_clf.get_depth())
print("-"*50)
Insights and Recommendations for the Post-Pruned Decision Tree
Insights and Recommendations for the Pre-Pruned Tree
Final Conclusions